Skip to content

More Polars plan optimizations for TPC-DS#22395

Open
Matt711 wants to merge 3 commits into
rapidsai:mainfrom
Matt711:imp/pdsds/more-pds-optimizations
Open

More Polars plan optimizations for TPC-DS#22395
Matt711 wants to merge 3 commits into
rapidsai:mainfrom
Matt711:imp/pdsds/more-pds-optimizations

Conversation

@Matt711
Copy link
Copy Markdown
Contributor

@Matt711 Matt711 commented May 6, 2026

Description

Hand-tune polars_impl for 19 TPC-DS benchmark queries in python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/. Each rewrite preserves query semantics and only changes how the polars LazyFrame is constructed; duckdb_impl is unchanged.

The optimizations apply a small set of recurring patterns that the polars optimizer does not (yet) perform automatically:

  • Predicate pushdown on dimension tables — pre-filter date_dim, item, store, etc. by literal predicates (year, quarter, month window, category/class/brand) before any join, so the join builds smaller hash tables.
  • Semi-join fact-table pre-filtering — use selective dimension keys (and in some cases store_returns (customer, item) pairs) as semi-join probes against the fact tables, shrinking them before the expensive joins.
  • Projection pushdownselect(...) only the columns each table contributes before joining, instead of relying on the planner to prune them later.
  • Condition-join → equi-join — replace cross-join + filter and CONDITIONALJOIN-style patterns with constant-key equi-joins where the predicate is equivalent.
  • Single-pass bucket aggregation — collapse multiple independent global-sum group-bys over the same fact table into one pass that emits the values in a single aggregation, replacing N scans with 1.
  • Join reordering — defer non-selective joins (e.g. customer) until after the selective filter chain so the row count entering the deferred join is much smaller.

Test plan

  • Run TPC-DS validation against DuckDB on the 19 modified queries
  • Run benchmark sweep and confirm no regressions vs. main on unmodified queries
  • Confirm result equality (sorted output) matches DuckDB reference

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 added Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 6, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels May 6, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python May 6, 2026
@Matt711 Matt711 marked this pull request as ready for review May 13, 2026 18:11
@Matt711 Matt711 requested a review from a team as a code owner May 13, 2026 18:11
@Matt711 Matt711 requested a review from rjzamora May 13, 2026 18:11
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 92acf26e-76aa-4b5a-b201-8a5ca6550b0d

📥 Commits

Reviewing files that changed from the base of the PR and between e1bb3be and f1a800d.

📒 Files selected for processing (1)
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py

📝 Walkthrough

Summary by CodeRabbit

  • Refactor
    • Optimized multiple benchmark queries by pre-filtering lookup/date/item tables, using selective semi-joins, and projecting only required columns to reduce joins, shuffles, and memory overhead.
    • Preserved existing results, ordering, and output shapes while improving query selectivity and throughput.

Walkthrough

Refactors many polars_impl TPC‑DS benchmark queries to apply early filtering and column projection, use semi-joins and derived pair sets, and reorder joins/aggregations while preserving final result expressions.

Changes

Polars Query Optimization Pattern

Layer / File(s) Summary
Basic dimension pre-filtering
q8.py, q43.py, q52.py, q55.py, q76.py, q98.py, q53.py
Dimension tables (date_dim, item, store, etc.) are filtered and projected before joining to fact tables; queries use semi-joins or reduced projections rather than joining full tables then filtering.
Date window and weekday aggregation
q2.py, q14.py, q67.py
date_dim is restricted to explicit multi-year/month windows; weekday/week aggregations and week-based pipelines operate on these reduced date sets and narrower column selections.
Fact-pair derivation and semi-join qualification
q17.py, q25.py, q29.py
Derive reduced qualifying key-pair frames (e.g., (customer_sk,item_sk) or ticket keys) from filtered returns, then semi-join those to pre-filter store_sales/catalog_sales before attaching item/store dimensions.
Join reordering, bucketing, and aggregation changes
q9.py, q18.py, q23.py, q44.py, q63.py, q88.py
Larger restructures: join order changes, bucketization/threshold computation introduced, cross-join replacements, and aggregation/ranking pipelines rewritten to use earlier filtering and narrower frames.

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • rjzamora
  • TomAugspurger
  • wence-
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: hand-tuned Polars query plan optimizations for 19 TPC-DS benchmark queries.
Description check ✅ Passed The description is well-related to the changeset, explaining the optimization patterns applied and the testing plan for the 19 modified queries.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py`:
- Line 121: The ranking uses Polars' ordinal method which gives unique ranks and
diverges from SQL RANK() semantics; update both occurrences where
pl.col("avg_profit").rank(method="ordinal").alias("rnk") is used (and any
subsequent filters like rnk < 11) to use method="min" instead so tied avg_profit
values receive the same rank with gaps (matching DuckDB/SQL RANK behavior).

In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.py`:
- Line 177: The aggregation pl.col("ext_sales_price").sum().alias("sales_amt")
returns 0 for all-null groups in Polars, which differs from SQL/DuckDB; replace
this with a conditional aggregation that checks
pl.col("ext_sales_price").count() > 0 and only returns the sum when count>0,
otherwise returns None, mirroring the null-sum handling used in q1/q49 so that
the "sales_amt" column matches SQL semantics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4b334632-2668-4f86-af10-2ff2b1dba11b

📥 Commits

Reviewing files that changed from the base of the PR and between ff37ba8 and e94077b.

📒 Files selected for processing (19)
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q14.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q17.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q18.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q2.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q23.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q43.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q52.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q53.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q55.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q63.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q67.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q8.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q88.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q9.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q98.py

Comment thread python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py Outdated
@Matt711 Matt711 force-pushed the imp/pdsds/more-pds-optimizations branch from e94077b to e1bb3be Compare May 14, 2026 02:02
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py (1)

121-121: ⚡ Quick win

Deduplicate the semi-join probe keys.

Same pattern here: sr_customer_item is acting as an existence set for two semi joins, so keeping duplicate (sr_customer_sk, sr_item_sk) rows only makes the join builds heavier. A .unique() should make the prefilter cheaper without changing semantics.

♻️ Proposed change
-    sr_customer_item = store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"])
+    sr_customer_item = store_returns_filtered.select(
+        ["sr_customer_sk", "sr_item_sk"]
+    ).unique()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py`
at line 121, sr_customer_item currently keeps duplicate (sr_customer_sk,
sr_item_sk) rows used as an existence set for two semi-joins; make the prefilter
cheaper by deduplicating it. Update the assignment for sr_customer_item (from
store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"])) to call
.unique() (or the equivalent drop_duplicates()) on the resulting frame so
duplicates are removed before using sr_customer_item in the semi-joins.
python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py (1)

126-126: ⚡ Quick win

Deduplicate the semi-join probe keys.

sr_customer_item is only used as the RHS of how="semi" joins, so duplicate (sr_customer_sk, sr_item_sk) pairs cannot change results but can still bloat both downstream join builds. Adding .unique() here should reduce the work in the hottest part of this rewrite.

♻️ Proposed change
-    sr_customer_item = store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"])
+    sr_customer_item = store_returns_filtered.select(
+        ["sr_customer_sk", "sr_item_sk"]
+    ).unique()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py`
at line 126, sr_customer_item currently selects ["sr_customer_sk","sr_item_sk"]
from store_returns_filtered but is only used as the RHS of semi-joins
(how="semi"), so duplicate (sr_customer_sk, sr_item_sk) pairs only bloat
downstream joins; change the creation to deduplicate the probe keys by applying
.unique() to the selection (i.e., replace the current
store_returns_filtered.select([...]) usage for sr_customer_item with
store_returns_filtered.select([...]).unique()) so the semi-join builds operate
on distinct keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py`:
- Line 121: sr_customer_item currently keeps duplicate (sr_customer_sk,
sr_item_sk) rows used as an existence set for two semi-joins; make the prefilter
cheaper by deduplicating it. Update the assignment for sr_customer_item (from
store_returns_filtered.select(["sr_customer_sk", "sr_item_sk"])) to call
.unique() (or the equivalent drop_duplicates()) on the resulting frame so
duplicates are removed before using sr_customer_item in the semi-joins.

In `@python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py`:
- Line 126: sr_customer_item currently selects ["sr_customer_sk","sr_item_sk"]
from store_returns_filtered but is only used as the RHS of semi-joins
(how="semi"), so duplicate (sr_customer_sk, sr_item_sk) pairs only bloat
downstream joins; change the creation to deduplicate the probe keys by applying
.unique() to the selection (i.e., replace the current
store_returns_filtered.select([...]) usage for sr_customer_item with
store_returns_filtered.select([...]).unique()) so the semi-join builds operate
on distinct keys.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 800a9d03-24c5-465a-8998-3ee5af1eaa20

📥 Commits

Reviewing files that changed from the base of the PR and between e94077b and e1bb3be.

📒 Files selected for processing (19)
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q14.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q17.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q18.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q2.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q23.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q25.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q29.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q43.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q52.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q53.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q55.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q63.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q67.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q8.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q88.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q9.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q98.py
🚧 Files skipped from review as they are similar to previous changes (17)
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q76.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q52.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q67.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q53.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q17.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q8.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q9.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q98.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q88.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q14.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q43.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q23.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q63.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q55.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q44.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q2.py
  • python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/q18.py

@Matt711
Copy link
Copy Markdown
Contributor Author

Matt711 commented May 14, 2026

/ok to test f1a800d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants